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Abstract 

The prediction of saliency areas in images has been tra¬ 
ditionally addressed with hand crafted features based on 
neuroscience principles. This paper however addreses the 
problem with a completely data-driven approach by train¬ 
ing a convolutional network. The learning process is for¬ 
mulated as a minimization of a loss function that measures 
the Euclidean distance of the predicted saliency map with 
the provided ground truth. The recent publication of large 
datasets of saliency prediction has provided enough data to 
train a not very deep architecture which is both fast and 
accurate. The convolutional network in this paper, named 
JuntingNet, won the LSUN 2015 challenge on saliency pre¬ 
diction with a superior performance in all considered met¬ 
rics. 


1. Introduction 

This work presents an end-to-end convolutional network 
(convnet) for saliency prediction. Our objective is to com¬ 
pute saliency maps that represent the probability of visual 
attention. This problem has been traditionally addressed 
with hand-crafted features inspired by neurology studies. 
In our case we have adopted a completely data-driven ap¬ 
proach, training a model with a large amount of annotated 
data. 

Convnet is a popular architecture in the field of deep 
learning and has been widely explored for visual pattern 
recognition, ranging from a global scale image classifica¬ 
tion to a more local object detection or semantic segmen¬ 
tation. The hierarchy of layers of convnets are also in¬ 
spired by biological models and actually recent works have 
pointed at a relation between the activity of certain areas 
in the brain with hierarchy of layers in the convnets ID. 
Provided with enough training data, convnets show impres¬ 
sive results, often outperforming other hand-crafted meth¬ 
ods . In many popular works, the output of the convnet is 
a discrete label associated to a certain semantic class. The 
saliency prediction problem, though, addresses the problem 
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Figure 1. Images (right) and saliency maps (left) from the iSUN 
and SALICON datasets. 


of a continuous range of values that estimate the probabil¬ 
ity of a human fixation on a pixel. These values present a 
spatial coherence and smooth transition that this work ad¬ 
dresses by using the convnet as a regression solver, instead 
of a classifier. 

The training of a convolutional network requires a large 
amount of annotated data that provides a rich description of 
the problem. Our work has benefited from the recent publi¬ 
cation of two datasets: iSun (2T) and SALICON 02). These 
datasets propose two different approaches for saliency pre¬ 
diction. While iSun was generated with an eye-tracker to 
annotate the gaze fixations, the SALICON dataset was built 
by asking humans to click on the most salient points on the 
image. The different nature of the saliency maps of the two 
datasets can be seen in Figure [I] The large size of these 
datasets has provided for the first time the possibility of 
training a convnet. 

Our main contribution has been the design of an end- 
to-end convnet for saliency prediction, the first one from 
this type, up to the authors knowledge. The network, 
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called JuntingNet , has proved its superior performance in 
the Large-scale Scene UNderstanding (LSUN) challenge 
2015 123]. The developed model has been publicly avail¬ 
able at http : / /bit. ly/ junt ingnet, 

This paper is structured as follows. Section [2] presents 
the previous works using convolutional networks for 
saliency prediction. Our system is presented in Section [3] 
and its results on the LSUN challenge reported in Section 
|4] The conclusions and future directions are contained in 
Section 0 

2. Related work 

JuntingNet presents the next natural step to two main 
trends in deep learning: using convnets for saliency predic¬ 
tion and training these networks by formulating and end-to- 
end problem. This section refers to some related work in 
these two fields. 

2.1. Deep learning for saliency prediction 

An early attempt of predicting saliency model with a 
convnet was the ensembles of Deep Networks (eDN) m, 
which proposed an optimal blend of features from three 
different convnet layers who were finally combined with 
a simple linear classifier trained with positive (salient) or 
negative (non-salient) local regions. This approach inspired 
DeepGaze El, which only combined features from differ¬ 
ent layers but, in this case, from a much deeper network. 
In particular, DeepGaze used the existing AlexNet convnet 
El, which had been trained for an object classification 
task, not for saliency prediction. JuntingNet adopts a not 
very deep architecture as eDN, but it is end-to-end trained 
as a regression problem, avoiding the reuse of precomputed 
parameters from another task. 

2.2. End to end semantic segmentation 

Fully Convolutional Networks (FCNs) El addressed 
the semantic segmentation task which predicting the se¬ 
mantic label of every individual pixel in the image. This 
approach dramatically improved previous results on the 
challenging PASCAL VOC segmentation benchmark m. 
The idea of an end-to-end solution for a 2D problem as 
as semantic segmentation was refined by DeepLab-CRF 
0, where the spatial consistency of the predicted labels is 
checked with a Conditional Random Field (CRF), similarly 
to the hierarchical consistency enforced in Q. In our work, 
we adopt the end-to-end solution for a regression problem 
instead of a classification one, and we also introduce a post¬ 
filtering stage, which consists of a Gaussian filtering that 
smoothes the resulting saliency map. 


3. JuntingNet 

This paper presents JuntingNet , an end-to-end convnet 
for saliency prediction. The parameters of our network are 
learned by minimizing an Euclidean loss function defined 
directly on the ground truth saliency maps. 

3.1. Architecture 

The detailed architecture of JuntingNet is illustrated in 
Figure [2] The network contains five learned layers: three 
convolutional layers and two fully connected layers, which 
can also be interpreted as lxl convolutions. 

The proposed architecture is not very deep if compared 
to other networks in the state of the art. Popular architec¬ 
tures trained on the 1, 200, 000 images of the ILSRVC 2012 
challenge proposed from 7 El to 22 layers inn . Junt¬ 
ingNet is defined by only 5 layers which are trained sep¬ 
arately on two training datasets collections of diverse sizes: 
6,000 for iSun and 10,000 for SALICON. This adopted 
shallow depth tries to prevent the overfitting problem, which 
is a great risk for models with a large amount of parameters, 
such as convnets. 

The detailed description of the convnet stages is the fol¬ 
lowing: 

1. The input volume has size of [96x96x3] (RGB im¬ 
age), a size smaller than the [227x227x3] proposed 
in AlexNet. Similarly to the shallow depth, this de¬ 
sign parameter is motivated to reduce the possibilities 
of overfitting. 

2. The receptive field of the first 2D convolution is of size 
[5x5], and its outputs define a convolutional layer with 
32 neurons. This layer is followed by a ReLU activa¬ 
tion layer which applies an element wise non-linearity. 
Later, a max pooling layer progressively reduces the 
spatial size of the input image. Despite the loss of 
visual resolution at the output, this reduction also re¬ 
duces the amount of model parameters and prevents 
overfitting. The max-pooling layer selects the maxi¬ 
mum value of every [2x2] region, taking strides of two 
pixels. 

3. The output of the previous stage has a size of 
[46x46x32]. The receptive field of this second stage 
is [3x3]. Again, this is followed by a RELU layer and 
a max-pooling layer of size [2x2]. 

4. Finally, the last convolutional layer is fed with an in¬ 
put of size [22x22x64]. The receptive of this layer is 
also of [3x3] and it has 64 neurons. A ReLU and max 
pooling layers are stacked too. 

5. A first fully connected layer receives the output of 
the third convolutional layer with a dimension of 
[10x10x64]. It contains a total of 4,608 neurons. 
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Figure 2. Convnet architecture for JuntingNet. 


6. The second fully connected layer consist of a maxout 
layer with 2,304 neurons. The maxout operation El 
computes the pairs of the previous layers output. 

7. Finally, the output of the last maxout layer is the 
saliency prediction array. The array is reshaped to have 
2D dimensions and resized to the stimuli image size. 
Finally, a 2D Gaussian filter with a standard deviation 
of 3.0 is applied. 

3.2. Training parameters 

The limited amount of training data for our architecture 
made overfitting a significant challenge, so we used differ¬ 
ent techniques to minimize its effects. Firstly, we apply 
norm constraint regularization for the maxout layers ED- 
Secondly, we use data augmentation technique by mirror¬ 
ing all images. We also tested a dropout layer na after the 
first fully connected layer, with a dropout ratio of 0.5 (50% 
of probability to set a neurons output value to zero). How¬ 
ever, this did not make much of a difference, so it is not 
included to the final model. 

The weights in all layers are initialized from a normal 
Gaussian distribution with zero mean and a standard devi¬ 
ation of 0.01, with biases initialized to 0.1. Ground truth 
values that we used for training are saliency maps with nor¬ 
malized values between 0 and 1. 

For validation control purposes, we split the training 
partitions of iSUN and SALICON datasets into 80% for 
training and the rest for real time validation. The net¬ 
work was trained with stochastic gradient descent (SGD) 
and Nesterov momentum SGD optimization method that 
helps the loss function to converge faster. The learning rate 



Figure 3. Learning curves for iSUN models. 


Training Curve 



Figure 4. Learning curves for SALICON models. 


was changing over time; it started with a higher learning 
rate 0.03 and decreased during the course of training until 
0.0001. We set 1,000 epochs to train a separate network for 
each dataset. Figures [3] and[4]present the learning curves for 
the iSUN and SALICON models, respectively. 
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4. Experiments 

4.1. Datasets 

The network was tested in the two datasets proposed in 
the LSUN challenge l23l: 

iSUN (21) : a ground truth of gaze traces on images from 
the SUN dataset (20ll . The collection is partitioned into 
6,000 images for training, 926 for validation and 2,000 
for test. 

SALICON 0~2): cursor clicks on the objects of interest 
from images of the Microsoft COCO dataset Ifl6l . The 
collection contains 10,000 training images, 5,000 for 
validation and 5,000 for test. 

4.2. Results 

Our solution is implemented using Python, NumPy and 
the deep learning library Theano (3] [2). Processing was 
performed on an NVidia GPU GTX 980 with 2048 CUDA 
cores and 4GB of RAM. Our network took between six to 
seven hours to train for the SALICON dataset, and five to 
six hours for the iSUN dataset. Every saliency prediction 
requires 200 ms per image. 

We assessed our model on the LSUN saliency predic¬ 
tion challenge 2015 (23). Table [T] and Table [2] presents our 
results for iSUN and SALICON datasets. The model was 
evaluated separately on the testing data of each datasets. 
The evaluation metrics was adopted of the variety of met¬ 
rics provided in MIT saliency benchmark (13] [4] defined 
on both saliency map and fixation points. JuntingNet con¬ 
sistent won the first place of the challenge in all metrics 
considered in the challenge. A few qualitative results are 
also provided in Figure [2] 

5. Conclusions 

We designed the first end-to-end ConvNet for saliency 
prediction, trained only with the datasets of visual saliency 
provided by the LSUN challenge. With this ConvNet we 
were able to win the first place in the challenge by large 
margin. Our results demonstrate that a not very deep Con- 
vNets are capable of achieving good results on a highly 
challenging task. 

Our experiments can be considered as preliminary, as 
only one configuration and set up was considered. We ex¬ 
pect that a more elaborate study of the architecture, use of 
the dataset and training parameters could still improve the 
reported performance. 

The developed model has been publicly availble from 

http://bit.ly/juntingnet, 
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Figure 5. Saliency maps generated by JuntingNet. The first column corresponds to the input image, the second column the prediction from 
JuntingNet and the third one and the third on to the provided ground truth. First three rows correspond to images from the iSUN dataset, 
while the last three are from the SALICON dataset. 
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Table 1. Results of the LSUN challenge 2015 for saliency prediction with the iSUN dataset. 
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Table 2. Results of the LSUN challenge 2015 for saliency prediction with the SALICON dataset. 
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